South Park is an American TV show. It is well known for being very satirical. Pretty much every famous person has already been made fun of in the series. I literally watch it every day! I also do lots of analyses in R every day. I just thought to myself, why haven’t I analysed South Park texts yet? And that’s when I decided to combine two things I am passionate about. Read on to see how easy it was!

So I have the idea, but where do I start?

First things first. I had to find a resource with all the text in a reasonable format. It took just a bit of Googling to find a South Park gold mine! I typed South Park scripts into Google and the very first link was exactly what I was looking for! South Park archives–a page with community maintained scripts for all episodes! Isn’t that great?

You can find a list of seasons on that page. And after clicking on a season, an episode list comes up. An episode page contains a nice table with two columns. The first column is a character name. And the second column is the actual line that character said. That’s a perfect start.

There was one last thing I wanted to know about each episode. Their popularity! I’m sure that you know IMDB–Internet Movie Database. It contains ratings for all movies and Tv shows as well.

But how to put it all together? I wrote a simple R package called southparkr that anyone can use and do their own analyses!

Data acquired. BINGO! Let’s dig in.

The second step was to determine, what exactly do I want to analyse? I decided on doing two things:

  1. Sentiment analysis of episodes,
  2. Episode popularity based on IMDB ratings.

We’ll get to that in a minute. We should first have a look at the data we acquired. Have a look at the following table. It summarises all episodes in a few numbers.

Number of seasons: 21
Number of episodes: 287
Number of words: 907 797
No stopwords (a, the, this, …): 310 759
% used for analysis: 34.23
Average IMDB rating: 8.14
Best episode (9.6): Scott Tenorman Must Die S05E04
Worst episode (6.3): Funnybot S15E02

You can see that the show has been on for 21 seasons already. All the characters combined have said almost 1 million words! That is if we count all words. If we exclude stop words, we end up with about 300 thousand words. Stop words are preposition, articles or other very usual words.

All the episodes sustain an average rating of roughly 8.1 which is great! It seems that the show is popular. I always consider anything above rating 8 very watchable! You can also see the best and the worst episode. So in case you don’t know the show, this is where you might start. It is almost guaranteed that you won’t be disappointed.

Let’s dig deeper and get sentimental!

We’ll tackle the first analysis now. The sentiment analysis of South Park episodes. It is a type of text analysis that scores words. The scores are positive and negative and can be expressed by numbers or words. We will be using the AFINN dictionary that scores words from -5 to 5. Where -5 is a very negative word, 0 is neutral and +5 is very positive.

For example, a -5 word is a bastard and a +5 word is thrilled!

All of this has been prepared for you behind the curtain. You will now see a few lines of code in R that show you a sentiment score of all episodes.

gg <- ggplot(by_episode, aes(x = episode_number, y = mean_sentiment_score, group = 1, text = text_sent)) +
    geom_col(color = "#592a88") +
    geom_smooth()

ggplotly(gg, tooltip = "text")

It created an interactive plot! You can hover over the bars to see some information. Each bar is an episode–you’ll see an episode name, number and the sentiment score upon hovering.

It’s just a few lines of code and the result is great! And above all, it is almost like writing an English sentence. This is how R programming looks like using the Tidyverse suite of packages.

You can see that most of the episodes have the bar pointing down, below zero. That’s mostly because the characters aren’t afraid to use dirty words. And they do it quite a lot!

You might also notice a blue line in the plot. It shows a trend in the sentiment over time. I can say that there was a large increase in the score from the beginning. It peeked roughly around episode 80 and then started falling again. You can simply see that the used language changes somehow over time.

Conclusion